Storage Models & Compression

Database Workloads

On-Line Transaction Processing (OLTP, 面向事务)

只进行简单的读写，侧重写，负载低(,Fast operations that only read/update a small amount of data each time,)

On-Line Analytical Processing (OLAP, 面向分析)

通过聚合的方式进行复杂的数据分析，侧重读，负载高(,Complex queries that read a lot of data to compute aggregates,)

Hybrid Transaction + Analytical Processing

对上述两种模式的混合(,OLTP + OLAP together on the same database instance,)

一言以蔽之

行存储适合 OLTP，列存储适合 OLAP

N-Ary Storage Model (aka "Row Storage"(,行存储,))

行存储对于 OLTP 场景效果较好。

在 OLTP 下，常常需要增删改查整行记录，行存储将行上的数据存在相邻的位置，对于此种场景性能较好。

但相对地，在 OLAP 场景下，由于通常只需用到表中的几个列，却需要读取全部的列数据，因此在 OLAP 下行存储性能较差。

Advantages

Fast inserts, updates, and deletes.
Good for queries that need the entire tuple.

Disadvantages

Not good for scanning large portions of the table and/or a subset of the attributes.

Decomposition Storage Model (aka "Column Storage"(,列存储,))

将表中所有 tuple 的同一字段的数据进行连续存储（如存在相同的页上）。

对于 OLAP 场景性能较好，但由于要获取整个 tuple 的数据时需要，对不同的字段进行多次查找，对于数据增删改查的场景性能较差。

Advantages

Reduces the amount wasted I/O because the DBMS only reads the data that it needs.
Better query processing and data compression (more on this later).

Disadvantages

Slow for point queries, inserts, updates, and deletes because of tuple splitting/stitching.

Tuple Identification

对于列存储，获取特定 tuple 的方式

Choice #1: Fixed-length Offsets

对一个字段，每个值占用相同的空间，这样就可以通过偏移量获取指定位置的值(,Each value is the same length for an attribute,)

Choice #2: Embedded Tuple Ids (由于性能较低（线性时间复杂度）较少使用)

Each value is stored with its tuple id in a column.

Compression

由于硬盘的 I/O 性能限制，DBMS 可以通过压缩数据减少读写的数据量以获得更好的 I/O 性能。

Goals

Must produce fixed-length values.

Only exception is var-length data stored in separate pool.

Postpone decompression for as long as possible during query execution (按需解压缩).

Also known as late materialization.

Must be a lossless scheme.

Compression Granularity(,粒度,)

Choice #1: Block-level

Compress a block of tuples for the same table.

Choice #2: Tuple-level

Compress the contents of the entire tuple (NSM-only).

Choice #3: Attribute-level

Compress a single attribute within one tuple (overflow).
Can target multiple attributes for the same tuple.

Choice #4: Column-level

Compress multiple values for one or more attributes stored for multiple tuples (DSM-only).

Columnar Compression(,面向列的压缩,)

Run-length Encoding

对于高度分类的字段较为有效，如性别

一个最简单的例子：

AAABBBCCDDDD -> A3B3C2D4

你也可以使用 RLE 三元组：

Value
Offset
Length

AAABBBCCDDDD -> (A, 0, 3)(B, 3, 3)(C, 6, 2)(D, 8, 4)

Bit-Packing Encoding

对于一些类型的数据，可以使用占用空间较小的类型进行存储

例如：

对于一个存储数字的字段，默认使用 int64 类型存储

如果该字段存储的值实际占用的空间都小于 int64 类型的实际空间

则可以使用占用空间更小的类型，如 int8、int16 进行存储

Mostly Encoding

如果一个字段中大多数(,Mostly,)的值都可以用小类型表示，可以将少数的必须用大类型表示的值放在一张单独的查询表中，在原位留下指针或偏移量，而其它的值就可以使用小类型表示。

Bitmap Encoding

使用一个位图(,Bitmap,)来表示一个字段中所有可能的值在每一个 tuple 中的实际值，由于每一位实际只占用一个 bit，因此更节约空间。

但对于可能值较多的字段，使用 Bitmap Encoding 可能会占用更多空间。

例如：

假设有一张性别表:

ID	Gender
1	M
2	M
3	M
4	F
5	M
6	F
7	M
8	M

使用 Bitmap Encoding 则可以将表优化为：

ID	Gender: M	Gender: F
1	1	0
2	1	0
3	1	0
4	0	1
5	1	0
6	0	1
7	1	0
8	1	0

Delta Encoding

只记录与上一 tuple 的数据差(,Delta,)而不记录完整数据

将基准值内联或记录在单独的查询表上
配合 RLE(,Run-length Encoding,) 使用以获取更高的压缩率

例如：

Time	Temp
12:00	99.5
12:01	99.4
12:02	99.5
12:03	99.6
12:04	99.4

在压缩之后：

12:00	99.5
+1	-0.1
+1	+0.1
+1	+0.1
+1	-0.2

配合 RLE 使用：

Time	Temp
12:00	99.5
(+1,4)	-0.1
	+0.1
	+0.1
	-0.2

Incremental Encoding

对于一组有规律的字符串，找出其中的最长公共前缀，通过记录前缀长度和字符串后缀来减少空间占用

例如：

Original
rob
robbed
robbing
robot

找出最长前缀：robb，压缩后：

Prefix Length	Suffix
0	rob
3	bed
4	ing
3	ot

Dictionary Encoding

用一张表将不定长度的值映射到占用空间较小的整型标识符

通常，标识符和值一一对应
Fast encoding and decoding
Support range queries

Two operations that is needed to support:

Encode/Locate: convert a given value into its compressed form
Decode/Extract: convert a given compressed into its original form

笔记
Dictionary Encoding 中通常不使用哈希算法来获取标识符，因为哈希算法会破坏数据原有的顺序。

Order-Preserving Encoding

在一些查询场景下，可以直接访问映射表进行查询而不经过原来的数据列

点此查看原文